skip to main content


Search for: All records

Creators/Authors contains: "Liu, Yalin"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Recent breakthroughs in deep-learning (DL) approaches have resulted in the dynamic generation of trace links that are far more accurate than was previously possible. However, DL-generated links lack clear explanations, and therefore non-experts in the domain can find it difficult to understand the underlying semantics of the link, making it hard for them to evaluate the link's correctness or suitability for a specific software engineering task. In this paper we present a novel NLP pipeline for generating and visualizing trace link explanations. Our approach identifies domain-specific concepts, retrieves a corpus of concept-related sentences, mines concept definitions and usage examples, and identifies relations between cross-artifact concepts in order to explain the links. It applies a post-processing step to prioritize the most likely acronyms and definitions and to eliminate non-relevant ones. We evaluate our approach using project artifacts from three different domains of interstellar telescopes, positive train control, and electronic healthcare systems, and then report coverage, correctness, and potential utility of the generated definitions. We design and utilize an explanation interface which leverages concept definitions and relations to visualize and explain trace link rationales, and we report results from a user study that was conducted to evaluate the effectiveness of the explanation interface. Results show that the explanations presented in the interface helped non-experts to understand the underlying semantics of a trace link and improved their ability to vet the correctness of the link. 
    more » « less
  2. null (Ed.)
    Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underperformed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning. 
    more » « less
  3. null (Ed.)
  4. Software traceability provides support for various engineering activities including Program Comprehension; however, it can be challenging and arduous to complete in large industrial projects. Researchers have proposed automated traceability techniques to create, maintain and leverage trace links. Computationally intensive techniques, such as repository mining and deep learning, have showed the capability to deliver accurate trace links. The objective of achieving trusted, automated tracing techniques at industrial scale has not yet been successfully accomplished due to practical performance challenges. This paper evaluates high-performance solutions for deploying effective, computationally expensive traceability algorithms in large scale industrial projects and leverages generated trace links to answer Program Comprehension Queries. We comparatively evaluate four different platforms for supporting industrial-scale tracing solutions, capable of tackling software projects with millions of artifacts. We demonstrate that tracing solutions built using big data frameworks scale well for large projects and that our Spark implementation outperforms relational database, graph database (GraphDB), and plain Java implementations. These findings contradict earlier results which suggested that GraphDB solutions should be adopted for large-scale tracing problems. 
    more » « less
  5. Software traceability establishes associations between diverse software artifacts such as requirements, design, code, and test cases. Due to the non-trivial costs of manually creating and maintaining links, many researchers have proposed automated approaches based on information retrieval techniques. However, many globally distributed software projects produce software artifacts written in two or more languages. The use of intermingled languages reduces the efficacy of automated tracing solutions. In this paper, we first analyze and discuss patterns of intermingled language use across multiple projects, and then evaluate several different tracing algorithms including the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM). Based on an analysis of 14 Chinese-English projects, our results show that best performance is achieved using mono-lingual word embeddings integrated into GVSM with machine translation as a preprocessing step. 
    more » « less
  6. In many regulated domains, traceability is established across diverse artifacts such as requirements, design, code, test cases, and hazards -- either manually or with the help of supporting tools, and the resulting trace links are used to support activities such as impact analysis, compliance verification, and safety inspections. Automated tracing techniques need to leverage the semantics of underlying artifacts in order to establish more accurate trace links and to provide explanations of links that have been created in either a manual or automated fashion. To support this, we propose an automated technique which leverages source code, project artifacts and an external domain corpus to generate a domain-specific concept model. We then use the generated concept model to improve traceability results and to provide explanations of the results. Our approach overcomes existing problems with deep-learning traceability algorithms, as it does not require a training set of existing trace links. Finally, as an initial proof-of-concept, we apply our semantically-guided approach to the Dronology project, and show that it improves over other tracing techniques that do not use a concept model. 
    more » « less
  7. Abstract

    Comparative genomics has revealed common occurrences in karyotype evolution such as chromosomal end-to-end fusions and insertions of one chromosome into another near the centromere, as well as many cases of de novo centromeres that generate positional polymorphisms. However, how rearrangements such as dicentrics and acentrics persist without being destroyed or lost remains unclear. Here, we sought experimental evidence for the frequency and timeframe for inactivation and de novo formation of centromeres in maize (Zea mays). The pollen from plants with supernumerary B chromosomes was gamma-irradiated and then applied to normal maize silks of a line without B chromosomes. In ∼8,000 first-generation seedlings, we found many B–A translocations, centromere expansions, and ring chromosomes. We also found many dicentric chromosomes, but a fraction of these show only a single primary constriction, which suggests inactivation of one centromere. Chromosomal fragments were found without canonical centromere sequences, revealing de novo centromere formation over unique sequences; these were validated by immunolocalization with Thr133-phosphorylated histone H2A, a marker of active centromeres, and chromatin immunoprecipitation-sequencing with the CENH3 antibody. These results illustrate the regular occurrence of centromere birth and death after chromosomal rearrangement during a narrow window of one to potentially only a few cell cycles for the rearranged chromosomes to be recognized in this experimental regime.

     
    more » « less
  8. Software projects produce large quantities of data such as feature requests, requirements, design artifacts, source code, tests, safety cases, release plans, and bug reports. If leveraged effectively, this data can be used to provide project intelligence that supports diverse software engineering activities such as release planning, impact analysis, and software analytics. However, project stakeholders often lack skills to formulate complex queries needed to retrieve, manipulate, and display the data in meaningful ways. To address these challenges we introduce TiQi, a natural language interface, which allows users to express software-related queries verbally or written in natural language. TiQi is a web-based tool. It visualizes available project data as a prompt to the user, accepts Natural Language (NL) queries, transforms those queries into SQL, and then executes the queries against a centralized or distributed database. Raw data is stored either directly in the database or retrieved dynamically at runtime from case tools and repositories such as Github and Jira. The transformed query is visualized back to the user as SQL and augmented UML, and raw data results are returned. Our tool demo can be found on YouTube at the following link:http://tinyurl.com/TIQIDemo. 
    more » « less
  9. Summary

    Haspin‐mediated phosphorylation of histone H3 at threonine 3 (H3T3ph) promotes proper deposition of Aurora B at the inner centromere to ensure faithful chromosome segregation in metazoans. However, the function of H3T3ph remains relatively unexplored in plants. Here, we show that in maize (Zea maysL.) mitotic cells, H3T3ph is concentrated at pericentromeric and centromeric regions. Additional weak H3T3ph signals occur between cohered sister chromatids at prometaphase. Immunostaining on dicentric chromosomes reveals that an inactive centromere cannot maintain H3T3ph at metaphase, indicating that a functional centromere is required for H3T3 phosphorylation. H3T3ph locates at a newly formed centromeric region that lacks detectable CentC sequences and strongly reducedCRMand ZmBs repeat sequences at metaphaseII. These results suggest that centromeric localization of H3T3ph is not dependent on centromeric sequences. In maize meiocytes, H3T3 phosphorylation occurs at the late diakinesis and extends to the entire chromosome at metaphase I, but is exclusively limited to the centromere at metaphaseII. The H3T3ph signals are absent in theafd1(absence of first division) andsgo1(shugoshin) mutants during meiosisIIwhen the sister chromatids exhibit random distribution. Further, we show that H3T3ph is mainly located at the pericentromere during meiotic prophaseIIbut is restricted to the inner centromere at metaphaseII. We propose that this relocation of H3T3ph depends on tension at the centromere and is required to promote bi‐orientation of sister chromatids.

     
    more » « less